Word Root Finder: a Morphological Segmentor Based on CRF

نویسندگان

  • Joseph Z. Chang
  • Jason S. Chang
چکیده

Morphological segmentation of words is a subproblem of many natural language tasks, including handling out-of-vocabulary (OOV) words in machine translation, more effective information retrieval, and computer assisted vocabulary learning. Previous work typically relies on extensive statistical and semantic analyses to induce legitimate stems and affixes. We introduce a new learning based method and a prototype implementation of a knowledge light system for learning to segment a given word into word parts, including prefixes, suffixes, stems, and even roots. The method is based on the Conditional Random Fields (CRF) model. Evaluation results show that our method with a small set of seed training data and readily available resources can produce fine-grained morphological segmentation results that rival previous work and systems.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Unicode Based Adaptive Segmentor

This paper presents a Unicode based Chinese word segmentor. It can handle Chinese text in Simplified, Traditional, or mixed mode. The system uses the strategy of divide-and-conquer to handle the recognition of personal names, numbers, time and numerical values, etc in the preprocessing stage. The segmentor further uses tagging information to work on disambiguation. Adopting a modular design app...

متن کامل

Report to BMM-based Chinese Word Segmentor with Context-based Unknown Word Identifier for the Second International Chinese Word Segmentation Bakeoff

This paper describes a Chinese word segmentor (CWS) based on backward maximum matching (BMM) technique for the 2 nd Chinese Word Segmentation Bakeoff in the Microsoft Research (MSR) closed testing track. Our CWS comprises of a context-based Chinese unknown word identifier (UWI). All the context-based knowledge for the UWI is fully automatically generated by the MSR training corpus. According to...

متن کامل

BMM-Based Chinese Word Segmentor with Word Support Model for the SIGHAN Bakeoff 2006

This paper describes a Chinese word segmentor (CWS) for the third International Chinese Language Processing Bakeoff (SIGHAN Bakeoff 2006). We participate in the word segmentation task at the Microsoft Research (MSR) closed testing track. Our CWS is based on backward maximum matching with word support model (WSM) and contextual-based Chinese unknown word identification. From the scored results a...

متن کامل

Chinese Segmentation with a Word-Based Perceptron Algorithm

Standard approaches to Chinese word segmentation treat the problem as a tagging task, assigning labels to the characters in the sequence indicating whether the character marks a word boundary. Discriminatively trained models based on local character features are used to make the tagging decisions, with Viterbi decoding finding the highest scoring segmentation. In this paper we propose an altern...

متن کامل

From “Manbearpig” to “Man bear pig”: An Evaluation of Unsupervised Word Segmentation Algorithms

In this paper, we explore diverse methods of unsupervised morphemic segmentation. We test Successor and Predecessor Count algorithms, Entropy algorithms, and Affix Discovery algorithms. The paper examines word stemming based on these algorithms, and the influence of training corpus size on segmentation accuracy. We propose variations on these algorithms to improve overall efficacy. While these ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012